I have choosen the Starbucks dataset from TidyTuesday. I like coffee and I know some basic things about it (like it has caffeine, you can add milk, you can have a more concentrated espresso or a “long one”, an Americano), but I have seen some data that I am not very familiar with. The goal is to see what variables predict some attributes for different Sturbucks beverages.
I have loaded the necessary packages. I have came back here every time I needed an other one
library(tidyverse)
library(corrplot)
library(performance)
library(ggpubr)
library(broom)
library(car)
library(multcompView)
starbucks <- read.csv(url("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv"))
View(starbucks)
I checked the type of variables I have, I am going to change them if it seems necessary
str(starbucks)
## 'data.frame': 1147 obs. of 15 variables:
## $ product_name : chr "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" ...
## $ size : chr "short" "tall" "grande" "venti" ...
## $ milk : int 0 0 0 0 0 0 0 0 0 0 ...
## $ whip : int 0 0 0 0 0 0 0 0 0 0 ...
## $ serv_size_m_l : int 236 354 473 591 236 354 473 591 236 354 ...
## $ calories : int 3 4 5 5 3 4 5 5 3 4 ...
## $ total_fat_g : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ saturated_fat_g: num 0 0 0 0 0 0 0 0 0 0 ...
## $ trans_fat_g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cholesterol_mg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sodium_mg : int 5 10 10 10 5 10 10 10 5 5 ...
## $ total_carbs_g : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fiber_g : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sugar_g : int 0 0 0 0 0 0 0 0 0 0 ...
## $ caffeine_mg : int 130 193 260 340 15 20 25 30 155 235 ...
Checking for any coding mistakes or extraordinary data mainly with graphs (pairing the variables and checking two at a time - also good for exploring any possible connection or correlation). Wherever I find an extreme outlier I check the variable type and will filter out if it does not make sense
ggplot(starbucks, aes(x = serv_size_m_l, y = calories)) +
geom_point () +
geom_smooth(method = "lm", se = FALSE)
ggplot(starbucks, aes(x = total_fat_g, y = saturated_fat_g)) +
geom_point()
ggplot(starbucks, aes(x = trans_fat_g, y = cholesterol_mg)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(starbucks, aes(x = sodium_mg, y = total_carbs_g)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(starbucks, aes(x = fiber_g, y = sugar_g)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(starbucks, aes(caffeine_mg, calories)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Service size equals zero does not seem to be right, I filtered them out.
Zero fat seems to be possible, I have not changed those.
There is an item with extreme high trans fat (2mg), I filtered it out. I do not really know if it is possible, but can be a quite big effect on a fitted model, so I have decided to extract it.
Other variables and values seemed to be right.
starbucks <- starbucks %>%
filter(serv_size_m_l > 0, trans_fat_g < 2)
I would like the whip to be a factor and rename the levels (0 becomes no, 1 becomes yes). Also checking what I did during the process
starbucks <-
starbucks %>%
mutate(whip = as.factor(whip))
is.factor(starbucks$whip)
## [1] TRUE
starbucks <- starbucks %>%
mutate(whip = fct_recode(whip, 'no' = '0', 'yes' ='1'))
levels(starbucks$whip)
## [1] "no" "yes"
I would also like size to be a factor, because it has definite levels and it is better as factor then as character to work with
starbucks <- starbucks %>%
mutate(size = as.factor(size))
levels(starbucks$size)
## [1] "grande" "short" "tall" "trenta" "venti"
I saw some not so flat lines when I made the plots, so I am checking the correlations between all the numeric variables (for this I need to filter out the two factors and the character name)
summary(starbucks)
## product_name size milk whip serv_size_m_l
## Length:1115 grande:333 Min. :0.000 no :836 Min. :236.0
## Class :character short :123 1st Qu.:1.000 yes:279 1st Qu.:354.0
## Mode :character tall :318 Median :3.000 Median :473.0
## trenta: 21 Mean :2.529 Mean :474.2
## venti :320 3rd Qu.:4.000 3rd Qu.:591.0
## Max. :5.000 Max. :887.0
## calories total_fat_g saturated_fat_g trans_fat_g
## Min. : 0.0 Min. : 0.00 Min. : 0.000 Min. :0.0000
## 1st Qu.:140.0 1st Qu.: 1.50 1st Qu.: 0.300 1st Qu.:0.0000
## Median :220.0 Median : 4.50 Median : 3.000 Median :0.0000
## Mean :234.3 Mean : 6.35 Mean : 3.984 Mean :0.1226
## 3rd Qu.:320.0 3rd Qu.:10.00 3rd Qu.: 7.000 3rd Qu.:0.2000
## Max. :640.0 Max. :28.00 Max. :20.000 Max. :0.5000
## cholesterol_mg sodium_mg total_carbs_g fiber_g
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. :0.0000
## 1st Qu.: 0.00 1st Qu.: 75.0 1st Qu.:22.0 1st Qu.:0.0000
## Median : 5.00 Median :135.0 Median :38.0 Median :0.0000
## Mean :15.63 Mean :143.5 Mean :38.7 Mean :0.8897
## 3rd Qu.:30.00 3rd Qu.:200.0 3rd Qu.:54.0 3rd Qu.:1.0000
## Max. :75.00 Max. :370.0 Max. :96.0 Max. :9.0000
## sugar_g caffeine_mg
## Min. : 0.00 Min. : 0.00
## 1st Qu.:19.00 1st Qu.: 30.00
## Median :34.00 Median : 75.00
## Mean :35.96 Mean : 89.66
## 3rd Qu.:49.50 3rd Qu.:145.00
## Max. :89.00 Max. :475.00
starbucks_cor <- starbucks %>%
select(3, 5:14)
summary(starbucks_cor)
## milk serv_size_m_l calories total_fat_g
## Min. :0.000 Min. :236.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:1.000 1st Qu.:354.0 1st Qu.:140.0 1st Qu.: 1.50
## Median :3.000 Median :473.0 Median :220.0 Median : 4.50
## Mean :2.529 Mean :474.2 Mean :234.3 Mean : 6.35
## 3rd Qu.:4.000 3rd Qu.:591.0 3rd Qu.:320.0 3rd Qu.:10.00
## Max. :5.000 Max. :887.0 Max. :640.0 Max. :28.00
## saturated_fat_g trans_fat_g cholesterol_mg sodium_mg
## Min. : 0.000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.300 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 75.0
## Median : 3.000 Median :0.0000 Median : 5.00 Median :135.0
## Mean : 3.984 Mean :0.1226 Mean :15.63 Mean :143.5
## 3rd Qu.: 7.000 3rd Qu.:0.2000 3rd Qu.:30.00 3rd Qu.:200.0
## Max. :20.000 Max. :0.5000 Max. :75.00 Max. :370.0
## total_carbs_g fiber_g sugar_g
## Min. : 0.0 Min. :0.0000 Min. : 0.00
## 1st Qu.:22.0 1st Qu.:0.0000 1st Qu.:19.00
## Median :38.0 Median :0.0000 Median :34.00
## Mean :38.7 Mean :0.8897 Mean :35.96
## 3rd Qu.:54.0 3rd Qu.:1.0000 3rd Qu.:49.50
## Max. :96.0 Max. :9.0000 Max. :89.00
str(starbucks_cor)
## 'data.frame': 1115 obs. of 11 variables:
## $ milk : int 0 0 0 0 0 0 0 0 0 0 ...
## $ serv_size_m_l : int 236 354 473 591 236 354 473 591 236 354 ...
## $ calories : int 3 4 5 5 3 4 5 5 3 4 ...
## $ total_fat_g : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ saturated_fat_g: num 0 0 0 0 0 0 0 0 0 0 ...
## $ trans_fat_g : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cholesterol_mg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sodium_mg : int 5 10 10 10 5 10 10 10 5 5 ...
## $ total_carbs_g : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fiber_g : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sugar_g : int 0 0 0 0 0 0 0 0 0 0 ...
correlations <- starbucks_cor %>%
cor()
correlations
## milk serv_size_m_l calories total_fat_g saturated_fat_g
## milk 1.00000000 -0.04110158 0.3727522 0.4933721 0.4834919
## serv_size_m_l -0.04110158 1.00000000 0.4574138 0.1738055 0.1588493
## calories 0.37275223 0.45741382 1.0000000 0.8027326 0.7609900
## total_fat_g 0.49337212 0.17380553 0.8027326 1.0000000 0.9649116
## saturated_fat_g 0.48349190 0.15884935 0.7609900 0.9649116 1.0000000
## trans_fat_g 0.35620533 0.12601432 0.6642680 0.8389354 0.7718382
## cholesterol_mg 0.30092362 0.13880467 0.7196849 0.8680969 0.8051751
## sodium_mg 0.33162104 0.41940461 0.8304040 0.5642401 0.5531148
## total_carbs_g 0.23947357 0.54374898 0.9162816 0.5259243 0.5072440
## fiber_g 0.11222790 0.11115593 0.3136440 0.2506472 0.1869549
## sugar_g 0.22591379 0.53813187 0.8933887 0.5011799 0.4901659
## trans_fat_g cholesterol_mg sodium_mg total_carbs_g fiber_g
## milk 0.3562053 0.3009236 0.3316210 0.2394736 0.1122279
## serv_size_m_l 0.1260143 0.1388047 0.4194046 0.5437490 0.1111559
## calories 0.6642680 0.7196849 0.8304040 0.9162816 0.3136440
## total_fat_g 0.8389354 0.8680969 0.5642401 0.5259243 0.2506472
## saturated_fat_g 0.7718382 0.8051751 0.5531148 0.5072440 0.1869549
## trans_fat_g 1.0000000 0.9653259 0.4474555 0.4202644 0.1308541
## cholesterol_mg 0.9653259 1.0000000 0.4897653 0.4753452 0.1137174
## sodium_mg 0.4474555 0.4897653 1.0000000 0.8291630 0.1434920
## total_carbs_g 0.4202644 0.4753452 0.8291630 1.0000000 0.2358490
## fiber_g 0.1308541 0.1137174 0.1434920 0.2358490 1.0000000
## sugar_g 0.4121353 0.4675981 0.8328058 0.9907429 0.1172173
## sugar_g
## milk 0.2259138
## serv_size_m_l 0.5381319
## calories 0.8933887
## total_fat_g 0.5011799
## saturated_fat_g 0.4901659
## trans_fat_g 0.4121353
## cholesterol_mg 0.4675981
## sodium_mg 0.8328058
## total_carbs_g 0.9907429
## fiber_g 0.1172173
## sugar_g 1.0000000
I am visulazing the correlations to see it better
corrplot(correlations, method = "circle")
The diagonal is every variable with itself, they are a perfect +1 correlation as they should be. As it was predictable different measurments of fat (trans_fat_g, total_fat_g, saturated_fat_g) have a strong correlation (r = 0.77 - 0.96). Carbs (total_carbs_g) and sugar (sugar_g) also have a predicted correlation (r = 0.99), the almost perfect correaltion makes sense, since the coffees containing coffee, milk, whip and sugar the only and main source of carbs is sugar. Sugar and fat has a connection with calories, moderate between trans fat and calories (r = 0.66), and strong with the others (r = 0.76 - 0.89). Just by taking a glance at the matrix, it seems that calories have the most connection to the other variables, so later, I will try to fit a model to predict calories in coffees.
As I have decided to work on finding out what makes a coffee with more calories, I would also like to see if it is true that whiped creamed beverages and bigger beverages have more calories
I am using independent sample t-test to see if coffees with shipped cream are more heavy in calories thank the ones without Nullhypothesis is that there is no difference, alternative hypothesis is that the whipped creamed have more calories
I am checking the normality
ggdensity(starbucks$calories,
main = "Density plot of calories",
xlab = "Calories of coffees")
ggplot(starbucks, aes(x = calories, fill = whip)) +
geom_histogram()
It seems skewed so I am running a Shapiro-Wilk test
shapiro.test(starbucks$calories)
##
## Shapiro-Wilk normality test
##
## data: starbucks$calories
## W = 0.98115, p-value = 7.524e-11
The Shapiro-Wilk came back significant so I need a nonparametric test
I am calculating the model with robust procedure
whip_t_test <- wilcox.test(calories ~ whip, data = starbucks)
whip_t_test
##
## Wilcoxon rank sum test with continuity correction
##
## data: calories by whip
## W = 27426, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
It seems like we have a significant difference in the calories of whipped and not whipped coffees. Z = 27426, p < 0.001
I checked a boxplot to see graphically which coffees have more calories
plot(calories ~ whip, data = starbucks)
It is the whipped, so we can state that coffees with whipped cream contain more calories (mean = 371.22 vs. mean of the no whipped cream coffees = 188.54)
Now I take a look at the size of a coffee related to its calories I am bulding a one-way ANOVA model to find out if there are any differences between the groups, and first I check the assumptions My nullhypothesis is that there are no differences My alternative hypothesis is that there are differences
Checking for outliers graphically
ggboxplot(starbucks, x = "size", y = "calories",
ylab = "Calories", xlab = "Size")
outlier <- starbucks %>%
filter (size == "trenta", calories < 10)
outlier
## product_name size milk whip serv_size_m_l calories total_fat_g
## 1 Cold Brewed Coffee trenta 0 no 887 5 0
## saturated_fat_g trans_fat_g cholesterol_mg sodium_mg total_carbs_g fiber_g
## 1 0 0 0 25 0 0
## sugar_g caffeine_mg
## 1 0 330
I have found an outlier in size “trenta” which turned out to be a cold brewed coffee and it makes sense, since as far as I know it does not contain any milk or sugar, so I left it in the data
I am checking the homogenity of variances as an assumption.
I would like them to be not significantly different (I have checked both the Levene’s and the Bartlett)
leveneTest(calories ~size, starbucks)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 33.788 < 2.2e-16 ***
## 1110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bartlett.test(calories ~size, starbucks)
##
## Bartlett test of homogeneity of variances
##
## data: calories by size
## Bartlett's K-squared = 139.67, df = 4, p-value < 2.2e-16
Unfortunately they are significantly different I need to use nonparametric or robust ways of ANOVA (I found a helpful article here)
starbucks_anova <- kruskal.test(calories ~ size, data = starbucks)
starbucks_anova
##
## Kruskal-Wallis rank sum test
##
## data: calories by size
## Kruskal-Wallis chi-squared = 273.35, df = 4, p-value < 2.2e-16
I found a significant model (H(4) = 273.35, p < 0.001)
To find out which groups differ from each other I have run a pairwise t-test with Bonferroni correction.
pairwise.t.test(starbucks$calories, starbucks$size, p.adj = "bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: starbucks$calories and starbucks$size
##
## grande short tall trenta
## short < 2e-16 - - -
## tall 1.5e-11 1.5e-06 - -
## trenta 0.13 0.17 1.00 -
## venti 1.0e-13 < 2e-16 < 2e-16 2.3e-06
##
## P value adjustment method: bonferroni
It looks like short and grande, tall and grande, tall and short are different in calories. Also venti versus every else size are different. (each p < 0.001). Trenta only differs from venti.
I have produced a boxplot to visualize it
boxplot(calories ~ size, data = starbucks)
Lastly I wanted to fit a complex model to predict calories in a coffee. I contained every variable that seemed to be correlated to calories.
First I built a less complex model (with obvious elements like sugar and whip) and checked the assumptions.
model1 <- lm(calories ~ whip + sugar_g + total_fat_g, starbucks)
summary(model1)
##
## Call:
## lm(formula = calories ~ whip + sugar_g + total_fat_g, data = starbucks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.040 -16.222 -4.301 9.134 114.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.14945 1.42020 13.484 <2e-16 ***
## whipyes -21.27456 2.53915 -8.379 <2e-16 ***
## sugar_g 4.03154 0.03782 106.608 <2e-16 ***
## total_fat_g 11.88029 0.19598 60.619 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.03 on 1111 degrees of freedom
## Multiple R-squared: 0.9684, Adjusted R-squared: 0.9683
## F-statistic: 1.136e+04 on 3 and 1111 DF, p-value: < 2.2e-16
check_model(model1)
model1 %>%
augment() %>%
arrange(desc(.cooksd)) %>%
head()
## # A tibble: 6 × 10
## calories whip sugar_g total_fat_g .fitted .resid .hat .sigma .cooksd
## <int> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 390 yes 45 10 298. 91.9 0.00461 23.9 0.0170
## 2 430 yes 58 10 351. 79.5 0.00512 23.9 0.0141
## 3 230 yes 47 11 318. -88.0 0.00418 23.9 0.0141
## 4 450 no 69 17 499. -49.3 0.0121 24.0 0.0130
## 5 430 yes 37 18 361. 69.1 0.00551 24.0 0.0115
## 6 400 yes 35 16 329. 70.9 0.00469 23.9 0.0103
## # … with 1 more variable: .std.resid <dbl>
Althought independent variables does not correlate highly, we know from the correlation matrix, that they do correlate with calories (dependent variable). If it wouldn’t I would compare this with the following more complex model:
model2 <- lm(calories ~ whip + sugar_g + total_fat_g + total_carbs_g, starbucks)
I am looking for variables that are not too highly correlated with calories.
First a simple model
model3 <- lm(calories ~ milk + serv_size_m_l, starbucks)
summary(model3)
##
## Call:
## lm(formula = calories ~ milk + serv_size_m_l, data = starbucks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -318.26 -76.80 -10.22 76.57 324.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.81520 11.60526 -3.431 0.000624 ***
## milk 31.62288 1.93165 16.371 < 2e-16 ***
## serv_size_m_l 0.40933 0.02071 19.765 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 107.9 on 1112 degrees of freedom
## Multiple R-squared: 0.3628, Adjusted R-squared: 0.3617
## F-statistic: 316.6 on 2 and 1112 DF, p-value: < 2.2e-16
check_model(model3)
The assumptions are met and I got a significant model (F(1112,2)=316.6, p < 0.001), where all of the predictors are significant (p < 0.001). Adjusted R^2 is 0.362, meaning that my predictors explain 36% of the calories variances.
I am bulding a more complex model
model4 <- lm(calories ~ milk + serv_size_m_l + fiber_g, starbucks)
summary(model4)
##
## Call:
## lm(formula = calories ~ milk + serv_size_m_l + fiber_g, data = starbucks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -297.22 -71.47 -14.26 65.62 343.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.05071 11.16212 -3.678 0.000247 ***
## milk 29.52137 1.87076 15.780 < 2e-16 ***
## serv_size_m_l 0.38700 0.02005 19.297 < 2e-16 ***
## fiber_g 19.26499 2.01721 9.550 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103.8 on 1111 degrees of freedom
## Multiple R-squared: 0.4111, Adjusted R-squared: 0.4096
## F-statistic: 258.6 on 3 and 1111 DF, p-value: < 2.2e-16
check_model(model4)
The assumptions are met again and I have a significant model (f(1111, 3) = 258.6, p < 0.001), where all service size, milk and fiber are significant (p < 0.001). Adjusted R^2 got higher, now it is 41% explained of the calories variances.
I am comparing the models without and with caffeine.
compare <- anova(model3, model4)
compare
## Analysis of Variance Table
##
## Model 1: calories ~ milk + serv_size_m_l
## Model 2: calories ~ milk + serv_size_m_l + fiber_g
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1112 12947653
## 2 1111 11965348 1 982305 91.208 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Turns out that model4 with fiber in it is a significantly better choice to predict calories in a coffe. (F(1) = 91.208, p < 0.001)
I wanted to see the relation between size, whipped cream and calories of coffees.
starbucks2 <- starbucks %>%
add_count(calories) %>%
group_by(size, whip) %>%
mutate(mean_cal = mean(n))
ggplot(starbucks2, aes(x = reorder (size, -mean_cal), y = mean_cal, fill = whip)) +
geom_col(position = "dodge") +
scale_fill_brewer() +
theme_classic() +
labs (title = "Mean calories of Starbucks coffees by size and whipcream", x = "Size", y = "Calories") +
theme (legend.position ="bottom")
It is interesting, because with the t-test I saw that whipped coffees have more calories, but here in subgroups it is not that obvious.
I have also made a plot about average calories in different sized Starbucks drinks
starbucks_to_plot <- starbucks %>%
group_by(size) %>%
count(mean(calories))
ggplot(starbucks_to_plot, aes(x = reorder(size, -n), y = n, fill = size)) +
geom_col() +
scale_fill_brewer() +
theme(legend.position = "None") +
labs(title = "Mean calories of Starbucks coffees by size", x = "Size", y = "Calories")